Denoising diffusion using Stochastic Differential Equations

LIX Seminar

Reminders: score modeling

The score function of a probability distribution with density \(p(x)\) is the gradient of the log-density: \[ \nabla_x \log p(x) \]

Working with this quantity has several advantages: - Bypasses the normalizing constant problem - Allows for sampling using Langevin algorithm - It is possible to learn \(s_\theta(x) \sim \nabla_x \log p(x)\)

Evolution of the score modeling approach

Initial proposal: score matching(Hyvärinen and Dayan 2005)

  • The quantity \(Trace\left(\nabla_x^2 \log p_\theta(x)\right)\) is difficult to compute
  • The score is untractable in low density areas

Given \(\{x_1, x_2, ..., x_T\} \sim p_\text{data}(x)\) Objective: Minimize the quantity \[ E_{p(x)}\left[\frac{1}{2}|| \log p_{\theta}(x)||² + Trace\left(\nabla_x^2 \log p_\theta(x)\right)\right]\]

 

Evolution of the score modeling approach

Initial proposal: score matching(Hyvärinen and Dayan 2005)

Learning the score of a noisy distribution(Vincent 2011)

No score of noise-free distribution

 

Evolution of the score modeling approach

Initial proposal: score matching(Hyvärinen and Dayan 2005)

Learning the score of a noisy distribution(Vincent 2011)

Denoising diffusion models(Sohl-Dickstein et al. 2015), annealed Langevin dynamics(Y. Song and Ermon 2019)

  • Gradually decrease noise in the distribution
  • Multiple variants of the DDPM algorithm we know

Annealed Langevin sampling

Evolution of the score modeling approach

Initial proposal: score matching(Hyvärinen and Dayan 2005)

Learning the score of a noisy distribution(Vincent 2011)

Denoising diffusion models(Sohl-Dickstein et al. 2015), annealed Langevin dynamics(Y. Song and Ermon 2019)

DDPM beats GAN(Ho, Jain, and Abbeel 2020)!

 

What is DDPM?

  • Forward process
  • Backward process
  • A denoiser \(\epsilon_\theta(., t)\), parameterizing the score function
  • A simple training objective \(L_\text{simple}=\left|\left|\epsilon-\epsilon_\theta\left(\underbrace{\sqrt{\bar{\alpha_t}}x_0+\sqrt{1-\bar{\alpha_t}}\epsilon}_{\text{Forward estimate of } x_t \text{ given } x_0},t\right)\right|\right|^2\)
  • This objective is equivalent to the score matching objective

Algorithm

Questions/Problems

  • Can we unify DDPM and other approaches in a common framework?
  • Number of timesteps \(T\) needs to be fixed before training
  • Can we fasten the sampling, ideally without needed re-training?
  • Can we model the data in a deterministic way using score modeling?

Proposed solution: Score modeling using Stochastic Differential Equations(Y. Song et al. 2021)!

Ordinary Differential Equations (ODE)

Equations of functions, of the form \(\frac{dx}{dt} = f(x, t)\) (order 1).

  • Unique solution for any initial condition \(x(t_0)\)
  • Geometric interpretation using vector fields

Stochastic diffential equations (SDE)

Equation of time-dependent stochastic processes, noted \(X_t\).

They are of the form

\(dx = \underbrace{f(x, t)dt}_{\text{"drift" term}} + \underbrace{g(t) dW_t}_{\text{"diffusion" term}}\),

where \(W_t\) is a “standard Wiener process” or Brownian motion.

They are used in many domains (finance, physics, biology, and even shape analysis)

Differences between SDE and ODE

  • Given an initial condition, an SDE has now multiple possible realizations!

  • The initial condition is now always \(x_0\), the time only goes Forward

  • Solving an SDE means looking for the trajectories denssity \(p_t(x)\)

Brownian motion

A stochastic process \(W_t\) is a Wiener process, or Brownian motion, if:

  • \(W_0 = 0\)
  • \(W_t\) is “almost surely” continuous
  • \(W_t\) has independent increments
  • \(W_t - W_s \sim \mathcal{N}(0, t-s), \text{ for any } 0 \leq s \leq t\)

If we sample 100 trajectories, we obtain:

Stochastic Differential Equation example

Properties of the SDE

Let the SDE be:

\[ dx = f(x, t)dt + g(t) d\bar{w} \]

  • Assuming some conditions on \(f(x, t)\) and \(g(t)\), the density \(p_t(x)\) at any time step is uniquely determined

  • Suppose we know, the score \(\nabla_x \log p_t(x)\) for all \(t\), then we can reverse the diffusion SDE using the following reverse-SDE:

\[ dx = \left[f(x,t) - g(t)² \nabla_x \log p_t(x)\right] dt + g(t) d\bar{w}, \]

where the time is flowing backwards, and \(\bar{w}\) is a backward Wiener process.

“Variance Exploding” SDE

This formulation is equivalent to denoising score matching of (Y. Song and Ermon 2019). We perturb the data by adding noise \(\tilde{x}_i \sim \mathcal{N}(x, \sigma_i² I)\). The forward process is such that

\[ x_i = x_{i-1} + \sqrt{\sigma_i² - \sigma_{i-1}²}z_{i-1}, z_{i-1} \sim \mathcal{N}(0, I) \]

Seen as the discretization of a continuous process, it becomes:

\[ \begin{align} x(t + \Delta t) & = x(t) + \sqrt{\sigma²(t+\Delta t) - \sigma²(t)} z(t) \\ & \simeq x(t) + \sqrt{\frac{d\left[\sigma²(t)\right]}{dt} \Delta t} z(t) \end{align} \]

The SDE formulates as \[ dx = \sqrt{\frac{d\left[\sigma²(t)\right]}{dt}} dw \]

“Variance preserving” SDE

This formulation is equivalent to the DDPM paper. The forward process of DDPM is given by: \[ x_i = \sqrt{1-\beta_i} x_{i-1} + \sqrt{\beta_i} z_i = \sqrt{1-\frac{\bar{\beta}_i}{N}} x_{i-1} + \sqrt{\frac{\bar{\beta}_i}{N}} z_i \]

Where \(\bar{\beta_i} = N \beta_i\).

Seen as the discretization of a continuous process, it becomes:

\[ \begin{align} x(t + \Delta t) & = \sqrt{1-\beta(t+\Delta t) \Delta t}x(t) + \sqrt{\beta(t+\Delta t) \Delta t} z(t) \\ & \simeq x(t) - \frac{1}{2} \beta(t+\Delta t) \Delta t x(t) + \sqrt{\beta(t+\Delta t) \Delta t} z(t) \text{ (Taylor expansion) } \\ & \simeq x(t) - \frac{1}{2} \beta(t) \Delta t x(t) + \sqrt{\beta(t) \Delta t} z(t) \end{align} \]

The SDE formulates as \[ dx = - \frac{1}{2} \beta(t) x dt + \sqrt{\beta(t)}dw \]

Comparisons of behavior of both SDEs

With original data distribution -> two diracs

 

The variance exploding has more curvature when approaching data \(\to\) DDPM obtains better samples with fewer steps.

Note: other samplers, such as DDIM(J. Song, Meng, and Ermon 2020) or EDM(Karras et al. 2022) can show even better curvature

Results

Probability flow ODE

Another remarkable result: Given the following SDE:

\[ dx = f(x, t)dt + g(t) d\bar{w}, \]

the following ODE :

\[ dx = \left[f(x, t) - g(t) \nabla_x \log p(x)\right] dt, \]

share the same marginals \(p_t(x)\) as the SDE solution, if \(p_0\) or \(p_T\) is given(Maoutsa, Reich, and Opper 2020).

Advantages of the ODE

Uniqueness: a data sample \(x(0)\) has a unique “latent” corresponding point \(x(T)\).

  • First consequence: “latent” manipulation of images

Note : We can’t manipulate the latents in a differentiable way. You would need a different approach, such as score distillation sampling or ControlNet

  • Second consequence: the choice of the model doesn’t matter

Advantages of the ODE

We can use well established solvers. It reduces the number of evaluation steps

Samples of images, with different number of network evaluations

Compared to stochastic sampling, it reduces the number of needed evaluations from \(\mathcal{O}(d)\) to \(\mathcal{O}(\sqrt{d})\), where \(d\) is the dimension of the data(Chen et al. 2023).

Summary

Findings of the paper:

  • Most approach of score modeling can be unified in a common framework, including DDPM
  • This framework allows for better sample quality and Higher resolution for images
  • The ODE formulation allows for high sampling speed, without needing retraining

Potential follow-up works

References

Chen, Sitan, Sinho Chewi, Holden Lee, Yuanzhi Li, Jianfeng Lu, and Adil Salim. 2023. “The Probability Flow ODE Is Provably Fast.” In Thirty-Seventh Conference on Neural Information Processing Systems. https://openreview.net/forum?id=KD6MFeWSAd.
Ho, Jonathan, Ajay Jain, and Pieter Abbeel. 2020. “Denoising Diffusion Probabilistic Models.” Advances in Neural Information Processing Systems 33: 6840–51.
Hyvärinen, Aapo, and Peter Dayan. 2005. “Estimation of Non-Normalized Statistical Models by Score Matching.” Journal of Machine Learning Research 6 (4).
Karras, Tero, Miika Aittala, Timo Aila, and Samuli Laine. 2022. “Elucidating the Design Space of Diffusion-Based Generative Models.” In Proc. NeurIPS.
Maoutsa, Dimitra, Sebastian Reich, and Manfred Opper. 2020. “Interacting Particle Solutions of Fokker–Planck Equations Through Gradient–Log–Density Estimation.” Entropy 22 (8): 802.
Poole, Ben, Ajay Jain, Jonathan T. Barron, and Ben Mildenhall. 2022. “DreamFusion: Text-to-3D Using 2D Diffusion.” arXiv.
Sohl-Dickstein, Jascha, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. 2015. “Deep Unsupervised Learning Using Nonequilibrium Thermodynamics.” In International Conference on Machine Learning, 2256–65. PMLR.
Song, Jiaming, Chenlin Meng, and Stefano Ermon. 2020. “Denoising Diffusion Implicit Models.” arXiv:2010.02502. https://arxiv.org/abs/2010.02502.
Song, Yang, and Stefano Ermon. 2019. “Generative Modeling by Estimating Gradients of the Data Distribution.” Advances in Neural Information Processing Systems 32.
Song, Yang, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. 2021. “Score-Based Generative Modeling Through Stochastic Differential Equations.” In International Conference on Learning Representations. https://openreview.net/forum?id=PxTIG12RRHS.
Vincent, Pascal. 2011. “A Connection Between Score Matching and Denoising Autoencoders.” Neural Computation 23 (7): 1661–74.
Zhang, Lvmin, Anyi Rao, and Maneesh Agrawala. 2023. “Adding Conditional Control to Text-to-Image Diffusion Models.”